Feature #117

Import Occupations List into Drupal

Added by David Lanier over 1 year ago. Updated about 1 year ago.

Status:Resolved Start date:12/15/2011
Priority:Normal Due date:
Assignee:Andrew Mallis % Done:

100%

Category:development Spent time: 12.50 hours
Target version:0.6.0 Estimated time:25.00 hours
Difficulty:average

Description

Determine basic technique. Top candidates currently are the Feeds module or the Migrate module

Implement the technique, including testing the data source (probably csv initially), and mapping source fields to Drupal fields.

Repeated runs should be smart, leaving unchanged data alone, adding and updating only data that has been added or updated.

See #97 for example datasets, as well as #110.


Related issues

related to #OWSdirectory - Task #97: Create master spreadsheet and import/merge data from all ... Resolved 01/23/2012 02/10/2012

History

Updated by David Lanier over 1 year ago

Installed feeds module and configured a test import, using a csv download of "Occupy Spreadsheet": https://docs.google.com/spreadsheet/ccc?key=0An-A-lITWCO8dHZSVkRjMk5MRU91MXFzRVpSWjRtWGc&hl=en_US#gid=0. It seems to have worked great.

Next steps are to finalize the data structure so that the content type can be built out, then we can do this mapping and real import.

If we want to enable updating nodes later (instead of deleting and replacing) we will need a field in the source to use as a GUID. Possibly a letter (to differentiate one source spreadsheet from another) followed by a sequential number. So this first spreadsheet might have rows identified as A0001-A0489, for example.

Updated by Andrew Mallis over 1 year ago

  • % Done changed from 0 to 30

This is awesome. Seems to have worked smoothly.

We'll be merging the spreadsheet you drew from with the others into this (currently empty) master directory:
https://docs.google.com/spreadsheet/ccc?key=0Aldd4WFocycHdDNXQ25ZRGllajNxWkdWQ1FIR210Wnc
which will get populated soon. Fields there aren't final either, though.

Updated by Andrew Mallis over 1 year ago

  • Status changed from New to In Progress
  • % Done changed from 30 to 20
  • Estimated time set to 25.00

The master doc has been updated to reflect what we feel is a reasonable data import.
(There will need to be some light modifications to the Occupation. Will update that task).

We've created 2 records for test import. 1 of them has bad data, just for fun.

Updated by David Lanier over 1 year ago

I've added feeds_tamper and mapped a few fields to test the importing of the main template spreadsheet (http://projects.occupy.net/issues/113#note-2). In short, it does work great. see http://dev.fga-directory.gotpantheon.com/admin/structure/feeds/edit/occupation_csv/mapping for the mapping.

To test multivalue fields, I used the Facebook pages field.

To test updates, I mapped the id field of the spreadsheet to the GUID field that Feeds offers. (I'm not sure exactly what is providing this GUID field, whether it's core, Feeds, or what.) With that field set as the unique target, and this page http://dev.fga-directory.gotpantheon.com/admin/structure/feeds/edit/occupation_csv/settings/FeedsNodeProcessor saying to update existing nodes, it does actually update when you make a small change to the spreadsheet and re-import. For the multi-value fields, the whole field (all values) gets replaced with the new field, the new set of values.

What remains to be done now (development wise) is to complete the mapping of all source fields of the spreadsheet to fields of the occupation content type.

Updated by David Lanier over 1 year ago

The address field is not currently mappable in feeds. I believe this issue http://drupal.org/node/1023068 has (or points to) the solution, but I have not tested it yet.

Updated by Andrew Mallis over 1 year ago

  • Status changed from In Progress to Feedback
  • Assignee changed from David Lanier to Marco Carbone
  • Priority changed from Normal to High
  • % Done changed from 20 to 50

Marco, can you please jump in here and work with David on this task?

Updated by Marco Carbone over 1 year ago

  • Status changed from Feedback to In Progress
  • % Done changed from 50 to 60

OK, I applied the patch to allow address integration with feeds (the patch was to the feeds module not to address fields). I've added those mappings and it seems to work great. I'll finish up the rest of the mappings tomorrow evening.

However, there are a few issues regarding what format feeds is expecting the data to be in:

I'm seeing that some rows in the spreadsheet use "(na)" when there is no state/province/terr -- I think those should probably be blanked out.

Updated by Marco Carbone over 1 year ago

  • Status changed from In Progress to Feedback
  • Assignee changed from Marco Carbone to Andrew Mallis
  • % Done changed from 60 to 90

The rest of the mappings are in place, and everything came in nicely from the template spreadsheet. e.g., http://dev.fga-directory.gotpantheon.com/occupation/occupy-wall-street

The one change I had to make was to add 'US' as the country to Occupy Atlanta.

So assuming you all are cool with conforming to the address limitations I mentioned in my last post here, then we're good to go. If you need more flexibility there because the data is messy, assign back to me and I can write a new plugin to do more complex handling of the country column.

Updated by Andrew Mallis over 1 year ago

  • % Done changed from 90 to 70

We'll massage the data for you. Great work.

bumping %done down a bit, 'cause it will take a while probably to cycle through the real list(s) once complete.

Updated by Marco Carbone over 1 year ago

  • Status changed from Feedback to In Progress
  • Assignee changed from Andrew Mallis to Marco Carbone
  • % Done changed from 70 to 60

Having some issues with how feeds imports into taxonomy fields for the twitter hashtags field. Taking this back.

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis
  • % Done changed from 60 to 70

OK, that's working now. Back to Andrew.

Updated by Andrew Mallis over 1 year ago

  • Category set to development
  • Assignee changed from Andrew Mallis to David Lanier

Here is the data source:
https://docs.google.com/spreadsheet/ccc?key=0Aldd4WFocycHdEdCVEtBNzFSZ0NUQlpBR1pBN2xON0E

the MASTER sheet should be targeted for import. We need to merge the 4th man data set into that. Forthcoming.

We're cleaning up the spreadsheet a bit, and standardizing our state codes.

I'm hoping to bring the rest of the movement onto this sheet. An inter-occ meeting is in the works for early next week.

Updated by Andrew Mallis over 1 year ago

  • Priority changed from High to Urgent

Let's get a bit more data into the dev instance, so we can troubleshoot our data format better.

Updated by Marco Carbone over 1 year ago

  • Assignee changed from David Lanier to Marco Carbone

Updated by Marco Carbone over 1 year ago

  • Status changed from In Progress to Feedback
  • Assignee changed from Marco Carbone to Andrew Mallis

OK, I imported import-master-4th and there are now 1923 items in there. I'm seeing a lot of dupes, etc. but I presume this is still in need of clean-up.

In order to get a successful import, I had to delete about 27 rows that had non-two-letter country abbrevations (e.g., 'global', 'intl', etc.)

Also, because there were no ID's, I had to remove ID from the mapping. If we are intending to do re-imports to update information, I think we'll need that ID, so it's probably worth making sure those get added to the first column. It's OK if they don't get added, but then it will be a one-time import deal.

Updated by Marco Carbone over 1 year ago

For some reason everything got imported with the Occupation founded date set to the current date. I'll look into figuring out why this is happening.

Updated by Marco Carbone over 1 year ago

OK, fixed the date issue with a patch and did another import.

Updated by Andrew Mallis over 1 year ago

Awesome. Thanks for fixing the date issue. We're still missing a lot of that information; it's in another source right now (the Guardian's).

Please import from the FGA Occupation MASTER Directory document, the sheet called "MASTER LIST" instead of the

I've added an additional column after the Zip called "Location Name" to hold data like "Liberty Plaza".

What should we be doing with these IDs? Can we realistically match to them and update on import, or is this relegated to our internal spreadsheet merge?

Updated by Andrew Mallis over 1 year ago

.

Updated by Marco Carbone over 1 year ago

I added the Location Name mapping.

Once country names are all in their full versions in the Master List, I can delete and run another import. Let me know when that's the case.

As for ID's, they are only useful if you intend to do later imports after the "real" one that updates information, and then ID's are used as the unique identifier. I haven't actually tested that yet. Should I, or do you think it won't be needed?

Updated by David Lanier over 1 year ago

I have tested imports based on IDs, and they did work. That was with only a couple records and a few fields, so it would be worth testing updates again if updates are needed.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Andrew Mallis to Marco Carbone

I've reset all the country names to full spelling.

I've set the state codes for US, Italy, Brazil, Canada, and Australia. Took some doing.

Please run another import.

I'll bring the ID issue up at the data meeting on Wed. No need to implement it unless we agree to that workflow. I'm feeling it's maybe 50-50 right now.

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis

Latest import has been done.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Andrew Mallis to Marco Carbone

I added another field in the spreadsheet for additional contact emails.
Import wil need some adjustment
Donate information in the data isn't limited to a single URL. I have added a textfield to the content type to handle the next import. Post import, the URL field can be deleted.

Data-wise, a bunch more de-duping has taken place, and occupations that are non-physical have been mostly separated. More to do still, but a new import will make things look better.

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis

OK, I updated the feed mappings to include the new email address field and the new donate field, and ran a new import on the latest spreadsheet.

I didn't delete the donate URL field yet as I think the changes to the export views to use the new fields still need to be made.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Andrew Mallis to Marco Carbone
  • % Done changed from 70 to 80

Please correct the export views (now a Feature) and delete the donate URL field.

Please re-run an import of the lastest MASTER sheet, which now contains 0 dupes!

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis

Re-ran the import. You probably got a bunch of emails for them from the rules notifications -- sorry about that. I had it set not to send the notification if the node author was anonymous, but that didn't seem to work. I modified it to a positive condition, only emailing if the node author is authenticated. Hope that works next time.

Two rows failed because their donate info value was longer than 128 characters, the max value set for that field. It's a pain to change that value, as it can't be done via the UI unless you delete all content. Do you want that to happen, or can you just update those two rows? (Not sure which two it is, but it would be the two with the longest values.)

Note that I'm currently not importing the ID column because it is missing for some values, and feeds complains if there are missing files for its GUID. Once we do a final import, we can make sure to have that field enabled for the import, assuming all rows have a value for it.

Updated by Marco Carbone over 1 year ago

It looks like the geocoding didn't run during this import. I think maybe some modules were updated without reapplying patches? I'm investigating. I put patches in sites/all/hacks -- does Pantheon have another approach for that?

Updated by Marco Carbone over 1 year ago

Yep, geocoder was updated without the patch. All other patches modules seem fine. Patch is no longer applying so I will have to address that.

Updated by Marco Carbone over 1 year ago

OK, reran import and geocoding worked this time. I also modified the occupation content type comment setting so that new occupations have comments open, as it was set to be closed by default.

We should discuss the module update process on our next hangout. It'd be ideal if each contrib module was checked out from git, but that requires git submodules and can be a bit of a headache.

Updated by Andrew Mallis over 1 year ago

Awesome.

I'm going to keep this ticket open; easier than creating new ones for each import refresh.

Sorry about geocoder. That was me. Forgot about the patch.

re: submodules I am happy to talk about this, bue maybe at Thursdays's meet? Tues, we'll likely be putting out fires. Just got to reading Randy Fay's blog article on this very subject last week.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Andrew Mallis to Marco Carbone

I've added a new field to the Occupation content type that we'll be wanting to import. It's a taxonomy (also added this to the Occupations feature) called occupation_type.

I anticipate we'll be needing to deal with the physical/non-physical occupation question sooner rather than later.

Can you please add this field to the import and run another pass off the MASTER list?

We've spent many hours merging in the information from The Guardian and now how have start dates for a lot of occupations and more specific geo-data.

Updated by Joel Farris over 1 year ago

Andrew, if we're interested in capturing more than simply "Does this Occupation have a physical presence?" then I agree that a taxonomy vocabulary is appropriate.

If, however, the answer to this question is a simple YES or NO, then I would advise that a boolean checkbox is far more appropriate to hold this bit of data; taxo vocabs are designed for more structured, semantic, translatable categorization.

Updated by Andrew Mallis over 1 year ago

The boolean on a physical presence is still appropriate.
The taxonomy categories will be things like: University, National, State-wide.

There are physical occupations that do not have a physical presence because they were evicted. That doesn't make them non-physical, though.

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis

Ran another import after adding the new taxonomy field.

Still getting two import failures because the donate_info data is too long. I also still have the unique checkbox off for the ID->GUID mapping because there are about 100 rows without an ID.

Here are the import steps:
1) Delete all existing nodes except for Occupy Wall Street. This can be done with VBO, but for now I've been doing it at /devel/php with:

set_time_limit(0); $nids = array(); $result = db_query("SELECT nid FROM {node} WHERE type = 'occupation' and created > 1327035600"); foreach ($result as $record) { if (is_numeric($record->nid)) { $nids[] = $record->nid; } } node_delete_multiple($nids);

2) Edit occupations content type (/admin/structure/types/manage/occupation/edit) and check the published box so imported occupations are published. You could also do this with VBO.

3) Run the import at import/occupation_csv

4) Edit the occupations content type and uncheck the published box (not needed if you are using VBO to publish the newly imported occupations)

Updated by Andrew Mallis over 1 year ago

Sorry about that. We are currently jamming on the spreadsheet again. We should have another import ready for tomorrow.

Updated by Andrew Mallis over 1 year ago

I ran another import on DEV, which has been pushed up to TEST and to LIVE. We're going to cycle another import, maybe 2 before we open up the floodgates.

We still need to create users!

I'm not sure if this is related to the import, but previously mapped nodes loose their maps on save, i.e.

http://dev.fga-directory.gotpantheon.com/occupation/us/ca/occupy-palo-alto#

I did mod the mapping to add UUID as unique and added the Type taxonomy mapping). Also noticed that custom permissions seem to be loosing their values

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Andrew Mallis to Marco Carbone

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis

I was getting that same map problem on Friday, but now it's gone. I suspect that it happens when the google api is at its limit. So this won't be a problem after the final import.

Also, I had already added the type taxonomy so I went ahead and removed the duplicate row.

"Also noticed that custom permissions seem to be loosing their values"

I'd open a new ticket for this. Not sure how those were lost, but the solution would be to put these permissions into the occupations feature.

Updated by Andrew Mallis over 1 year ago

though the perms were in there.
#378 takes care of this

There is a new merge. I'm going to import it now. 1390 occupations.

Google really screwed a lot of people with their low limit. And their rates are crazy high for this service.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Andrew Mallis to Marco Carbone

noticed the import-export feature is out of data on DEV, now. Related?

$ drush @fga-directory.dev fr occupations_import_export

or

$ drush @fga-directory.dev fu occupations_import_export

please

Updated by Marco Carbone over 1 year ago

  • Assignee changed from Marco Carbone to Andrew Mallis

Reverted, as the override removed the occupation type field.

Re: Google -- we could try Yahoo Placefinder instead, which allows 5000 geocodes a day as compared to Google's 2500. We'd just need to add an API key here: http://dev.fga-directory.gotpantheon.com/admin/config/content/geocoder and switch the geocoder here: http://dev.fga-directory.gotpantheon.com/admin/structure/types/manage/occupation/fields/field_occupation_geo/edit

Updated by Andrew Mallis over 1 year ago

Signed up for yahoo API and switched defaut geocoder; added key.
Added verification file and pushed to DEV and verified DEV site with yahoo.
LIVE verification pending pull down there and testing of new code.

Updated by Andrew Mallis over 1 year ago

  • Status changed from Feedback to In Progress
  • Priority changed from Urgent to Normal
  • Target version changed from 0.5.0 to 0.6.0
  • % Done changed from 80 to 90

Ran our last import (for now)!

Updated by Andrew Mallis about 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Also available in: Atom PDF